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(54) Flow control in adaptive pipelines 

(57) A flow control method for active pipelines seg- 
ments macro processes into tasks. Each time a co- 
processing device becomes available, a scheduling al- 
gorithm is run to determine which task is dispatched to 
the available coprocessing device for execution. The 
scheduling algorithm determines a slack time to meet 



deadline for each of the tasks and ranks the tasks ac- 
cordingly with the tasks having the shorter slack times 
having the higher ranking. The highest ranked task that 
has a buffer available to read data from and a buffer 
available to write data into is dispatched for execution 
by the available coprocessing device. 
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Description 

[0001] This invention was made with United States 
Government support under Award No. 70NANB5H1176 
by the National Institute of Standards and Technology, 
Department ot Commerce. The United States has cer- 
tain rights in the invention. 

BACKGROUND OF THE INVENTION 

[0002] The present invention relates to multimedia 
services, and more particularly to flow control in adap- 
tive pipelines to accelerate data processing. 
[0003] Programming functionality at nodes of a net- 
work, as opposed to passive transformation by fixed 
function network elements, is an emerging trend in net- 
working - also known as active networking. Active nodes 
in the active network use available information within the 
network to build adaptive transformation tunnels that en- 
able multiple network data flows to exist simultaneously. 
Nowhere is the need more obvious than in realtime net- 
worked multimedia services like transcoding. 
[0004] Previously the methodology was to develop a 
system in software, determine what were the hot-spots 
that contribute to performance degradation, and then 
develop a coprocessing device to accelerate the hot- 
spots. Field programmable gate arrays (FPGAs) have 
been proposed to address flexible reuse of the co- 
processing device. As point solutions, this methodology 
is fine. However application complexity growth at the 
moment tracks an exponential curve. So it is becoming 
apparent that this methodology serves a very limited ob- 
jective. The network bandwidth is growing at a much 
faster rate than coprocessing speed. 
[0005] Active nodes in the form of adaptive pipelines 
are described in co-pending U.S. Patent Application No. 
09/191 ,929 filed November 1 3, 1998 by Raja Neogi en- 
titled "System for Network Transcoding of Multimedia 
Data Flow", and optimized routing of client service re- 
quests through such an active network is described in 
co-pending U.S. Patent Application No. 09/ filed , 1999 
by Raja Neogi entitled "Resource Constrained Routing 
in Active Networks." As described in the '929 Patent Ap- 
plication each macro process, such as video data trans- 
coding, at an active node may be split into smaller tasks, 
typically characterized as processes or threads, to facil- 
itate either virtual or real multiprocessing. For example 
the transcoder macro process shown in Fig. 1 may be 
segmented into three tasks - decoding compressed vid- 
eo data, filtering the uncompressed video data from the 
decoding stage, and encoding the filtered uncom- 
pressed video data from the filtering stage to produce 
transcoded compressed video data. In some cases one 
or more of the tasks may need more compute cycles 
than can be allotted. This results in the breakdown of 
that task into smaller tasks, such as the encoding task 
being broken down into motion estimating and coding 
tasks. Adding tasks results in latency, and all clients re- 



questing service have a parameter to specify end-to- 
end latency for the macro process. When macro proc- 
ess segmentation results in a segmentation strategy 
that meets pipeline latency, but fails to meet end-to-end 

s latency requirements, sub-optimal parameter values for 
one or more of the tasks are chosen. For multitasking 
many macro processes may be running on the active 
network simultaneously or a macro process may be bro- 
ken into many tasks, resulting in the total number of 

10 tasks generally exceeding the number of coprocessing 
devices available. Therefore a scheduling policy is 
needed to assure that all tasks are completed within the 
given latency requirements. What is desired is a flow 
control process in adaptive pipelines that schedules 

'5 multiple tasks to meet latency requirements. 

SUMMARY OF THE INVENTION 

[0006] Accordingly the present invention provides a 

20 flow control process for adaptive pipelines that sched- 
ules the tasks that make up the pipelines to assure that 
latency requirements are met. When a coprocessing de- 
vice is available to accept a task, a scheduler processes 
all tasks simultaneously to determine how long data for 

25 the task has been in the queue, how much slack is avail- 
able to process the task based upon its time to finish, 
and whether the input and output buffers from buffer 
pools are ready to be read from and written into, respec- 
tively. The tasks that are ready are ranked, and the task 

30 with the shortest slack time is dispatched to the co- 
processing device for execution. For multiple coproces- 
sors, each time one of the coprocessors finishes a task 
and becomes available, the above algorithm is run to 
determine the next task to be processed by that coproc- 

35 essor. 

[0007] The objects, advantages and other novel fea- 
tures of the present invention are apparent from the fol- 
lowing detailed description when read in conjunction 
with the appended claims and attached drawing. 

40 

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF 
THE DRAWING 

[0008] Fig. 1 is an illustrative view of the segmentation 
45 of a macro process into a plurality of tasks according to 
the present invention. 

[0009] Fig. 2 is a timing diagram for flow control in 
adaptive pipelines according to the present invention. 
[0010] Fig. 3 is an illustrative view for flow control in 
50 adaptive pipelines according to the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

[0011] Each thread or task of a macro process, as 
55 shown in Fig. 1 , is considered to be an atomic image 
that executes uninterrupted on a mapped coprocessing 
device for a maximum preset time limit. In the television 
arts the human eye is sensitive to artifacts that show up 
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at rales of 30 frames or 60 fields per second. Therefore 
adaptive pipelines for video applications are driven with 
a period of 33 milliseconds. This guarantees a through- 
put of 30 frames per second and also imposes a per- 
task latency of 33 milliseconds. Each macro process or 
client requested service is first segmented, as indicated 
above, into logical tasks with a process cycle, or time to 
finish (TF), and buffer memory requirement for each 
task. This preestimation on a logarithmic scale is noted 
as 'network intelligence" in appropriate tables for each 
type of service request, such as transcoding. If a logical 
task does not meet the latency criteria, further segmen- 
tation is exercised. This approach is applied hierarchi- 
cally until the grain of the task, perhaps optimized for 
the coprocessing devices, fits the latency bounds. To as- 
sure meeting the latency bounds, sub-optimum param- 
eters may be chosen for some tasks. For example, while 
coding to generate an MPEG transport stream, one may 
drop from full CCIR-601 resolution to SIF resolution. 
Bounded multi-value specification of parameters pro- 
vides flexibilities. 

[0012] Referring now to Figs. 2 and 3 an illustrative 
transcoding macro process 10 is shown segmented into 
decoding 12, filtering 14 and encoding 16 threads or 
tasks. A plurality of buffer pools 1 8-24 store the data be- 
ing processed between tasks. The light and dark 
squares within each of the buffer pools 18-24 represent 
respectively empty and full buffers, with the size repre- 
sentative of either compressed (narrow in width) or un- 
compressed (wide in width) video data for this transcod- 
ing illustration. Compressed data is received in a buffer 
in the first buffer pool 18, where it is accessed by the 
decoding task 12. The decoding task 12 then outputs 
the uncompressed data into a buffer of the next buffer 
pool 20. Likewise the filtering task 14 reads uncom- 
pressed data from the buffer of the second buffer pool 
20 and writes filtered data into a buffer of the next buffer 
pool 22. Finally the encoding task 16 reads the filtered 
data from the buffer of the third buffer pool 22 and writes 
compressed data into a buffer of the final buffer pool 24. 
The resulting compressed data is returned to the re- 
questing client from the buffer of the final buffer pool 24. 
Each time the data is placed into a buffer of one of the 
buffer pools 18-24 it is marked with a time of allocation 
(TAC) to the buffer pool. 

[001 3] Where the number of coprocessing devices is 
less than the number of tasks to be pertormed : each 
time one of the coprocessing devices becomes availa- 
ble, ail of the tasks are processed by a scheduling algo- 
rithm to determine which task is dispatched to the avail- 
able coprocessing device for execution. The time when 
the scheduling algorithm is called is given a current time 
stamp (CTS). As shown in Fig. 2 a system clock C s has 
a period of K, such as 33 milliseconds, and each task 
needs to be completed within such clock period. A first 
difference is calculated to determine the "age" of the da- 
ta in the buffer pool, DIFF1 =CTS -TAC. The oldest data 
in each buffer pool is the data available for the next task 



that accesses data from that buffer pool. The CTS for 
each task is subtracted from the real time T for the end 
of the current system clock period to determine a second 
difference, DIFF2 = T - CTS, the remaining time in the 

5 period for completion of the task. A slack time to meet 
the deadline of the end of the period (SMD) is deter- 
mined by subtracting the TF (time to finish the task) for 
that task from the time remaining, SMD = DIFF2 - TF. 
The tasks are then ranked by their SMD values, from 

10 lowest value to highest value. The task at the top of the 
rank is tested to see if it is ready for processing, i.e., is 
there a buffer in the preceding buffer pool that is ready 
to be read (RS) and a buffer in the succeeding buffer 
pool that is ready to be written into (WS). The highest 

is ranked task, i.e., the task with the smallest SMD value, 
that is ready is then dispatched to the available co- 
processing device for execution using the oldest data in 
the input buffer pool as determined above by DIFF1. 
This scheduling algorithm is repeated each time one of 
20 the coprocessing devices completes a task and be- 
comes available. 

[0014] The TF for each task is a function of the pa- 
rameters associated with the task as contained in the 
network intelligence tables. If the task cannot be com- 

25 pleted within the given latency bounds, then a sub-op- 
timum parameter from the tables is used, reducing the 
TF for that task. In this manner the processing is flexible 
in order to meet the latency requirements of the service 
request. Although the above has been illustrated with 

30 respect to multiple tasks from a single macro process, 
more than one macro process may be in process at any 
given time, which increases the number of tasks vying 
for processing time. The overhead for the task schedul- 
ing is insignificant in comparison to the processing time 

35 for the tasks themselves. 

[0015] Thus the present invention provides a flow 
control in adaptive pipelines to meet latency bound re- 
quirements by determining each time a coprocessor be- 
comes available which waiting task has the shortest 

40 slack time within a given processing period, and 
processing that task if there is both a buffer ready to be . 
read from and another ready to be written into. 



45 Claims 

1 . A method of flow control in active pipelines having 
a plurality of coprocessing devices for processing 
data comprising the steps of: 

so 

segmenting macro processes input to the ac- 
tive pipelines, each macro process being seg- 
mented into a plurality of tasks; 
when a coprocessing device from the plurality 
55 of coprocessing devices becomes available, 

scheduling the task from among the plurality of 
tasks to be sent to the available coprocessing 
device according to specified criteria; 
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repeating the scheduling step each time one of 
the coprocessing devices becomes available 
until there are no more tasks to be processed. 

The method according to claim 1 wherein the s 
scheduling step comprises the steps of: 



determining a slack time for each task within a 
given processing period; 

ranking the tasks in order of the slack times, the 10 
shorter slack times having the higher ranks; 
dispatching to the available coprocessing de- 
vice the task having the highest rank that is also 
ready. 

15 

The method as recited in claim 2 wherein the deter- 
mining step comprises the steps of: 

obtaining a difference between a time at the 
end of the given processing period and the time 20 
the coprocessing device became available; 
and 

obtaining the slack time as the difference be- 
tween the difference and a time to finish for the 
task. 25 



The method as recited |n claim 3 wherein the deter- 
mining step further comprises the steps of: 

determining if a buffer having input for the task 30 
is ready to be read from; and 
determining if a buffer for the output from the 
task is ready to be written into, the task being 
ready if both conditions are satisfied. 

35 
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